****Delete this notes section eventually****
glm(formula = score_factor ~ gender_factor + age_factor + race_factor + priors_count + crime_factor + two_year_recid, family = "binomial", data = df)Our github repo
ProPublica main article
ProPublica methodology
ProPublica github
Raw version of COMPAS survey
Columbia article on Risk as proxy for race (2010) Article on Risk, Race, and Recidivism
Original publication by COMPAS creator
Assignment instructions on Canvas
Project description on Canvas
****End of team notes****
#Suppressing warnings about libraries
#Import libraries (currently overdoing it)
library(tidyverse)
library(ggplot2) # graphics library
library(gridExtra) # For displaying graphs side-by-side
library(knitr) # contains knitting control
library(tree) # For the tree-fitting 'tree' function
library(randomForest) # For random forests
library(rpart) # For nicer tree fitting
library(partykit) # For nicer tree plotting
library(boot) # For cv.glm
library(leaps) # needed for regsubsets (though maybe not relevant b/c our outcome vars are binary)
library(plotly)
library(rsample) # data splitting, just trying to see if works (for naive bayes)
library(dplyr) # data transformation, just trying to see if works (naive bayes)
library(caret) # naive bayes package
library(h2o) # naive bayes package
library(MASS) # For LDA
#Format numbers so that they are not in scientific notation.
options(scipen = 4)
In 2016, ProPublica published a story on how a commonly-used pre-trial risk assessment tool called the COMPAS is racially biased. The journalists showed that in spite of a 2009 validation study showing similar accuracy rates for black and white men (67 percent versus 69 percent), the inacccuracies were in opposite directions for the two groups. This racial bias of the tool is reflected in their their published table replicated below:
| Type of error | White | African American |
|---|---|---|
| Labeled Higher Risk, But Didn’t Re-Offend | 23.5% | 44.9% |
| Labeled Lower Risk, Yet Did Re-Offend | 47.7% | 28.0% |
The COMPAS is widely used across states [add more here from ProPublica article]. It gives people a risk score ranging from 1 to 10, where risk scores of 1 to 4 are “Low”, 5 to 7 are labeled “Medium”, and 8 to 10 are labeled “High.” Although race is not included in its 137 questions, some of the questions such as how often people moved can be linked to poverty…[add more here from]
Using the available data, construct a Risk Assessment Instrument (RAI) for predicting two-year recidivism.
Create an RAI for predicting violent recidivism.
Assess whether the RAIs from (A) and (B) are equally predictive across race/ethnicity groups? How about across age and gender_factor groups?
Compare your RAIs to the COMPAS RAI.
Our goal is to investigate whether it is possible to create a Risk Assessment Instrument (RAI) that is more accurate and less racist than the COMPAS.
This analysis is run using ProPublica’s data. In order to have clean data, it is necessary to remove certain observations. Fortunately, ProPublica staff published their Jupyter notebook with data cleaning steps in R, so our data is cleaned in the same way (ex. dropping people whose charge date was not w/in 30 days ). However, although we drop the same observations (rows) as they do, we have adapted their code so that it does not drop any attributes (columns) because this would be bad practice for data mining.
###################################
# Read in all ProPublica datasets #
###################################
compas.scores.raw <- read_csv("C:\\Users\\uukay\\OneDrive\\Documents\\mini4\\Data Mining\\Final project\\RecidivismPrediction\\datasets\\compas-scores-raw.csv")
compas.scores <- read_csv("C:\\Users\\uukay\\OneDrive\\Documents\\mini4\\Data Mining\\Final project\\RecidivismPrediction\\datasets\\compas-scores.csv")
## Warning: Duplicated column names deduplicated: 'decile_score' =>
## 'decile_score_1' [45]
compas.scores.two.years <- read_csv("C:\\Users\\uukay\\OneDrive\\Documents\\mini4\\Data Mining\\Final project\\RecidivismPrediction\\datasets\\compas-scores-two-years.csv")
## Warning: Duplicated column names deduplicated: 'decile_score' =>
## 'decile_score_1' [40], 'priors_count' => 'priors_count_1' [49]
compas.scores.two.years.violent <- read_csv("C:\\Users\\uukay\\OneDrive\\Documents\\mini4\\Data Mining\\Final project\\RecidivismPrediction\\datasets\\compas-scores-two-years-violent.csv")
## Warning: Duplicated column names deduplicated: 'decile_score' =>
## 'decile_score_1' [40], 'priors_count' => 'priors_count_1' [49], 'two_year_recid'
## => 'two_year_recid_1' [54]
cox.parsed <- read_csv("C:\\Users\\uukay\\OneDrive\\Documents\\mini4\\Data Mining\\Final project\\RecidivismPrediction\\datasets\\cox-parsed.csv")
## Warning: Duplicated column names deduplicated: 'decile_score' =>
## 'decile_score_1' [40], 'priors_count' => 'priors_count_1' [49]
cox.violent.parsed <- read_csv("C:\\Users\\uukay\\OneDrive\\Documents\\mini4\\Data Mining\\Final project\\RecidivismPrediction\\datasets\\cox-violent-parsed.csv")
## Warning: Duplicated column names deduplicated: 'decile_score' =>
## 'decile_score_1' [40], 'priors_count' => 'priors_count_1' [49]
#Next step 1 (low LOE): Somebody fix so the file paths aren't just for my local directories. Here are three options in order of how ideal they would be:
# The propublica github files
# My forked version of the propublica github files: https://github.com/kaylareiman1/compas-analysis
# Our copied and pasted version of the propublica github files: https://github.com/kaylareiman1/RecidivismPrediction/tree/master/datasets
####################################
# Adapt ProPublica's Data Cleaning #
####################################
#Note: This code is copied directly from ProPublica. Here is why they dropped data:
# - dropped if charge date not w/in 30 days
# - Coded the recidivist flag -- is_recid -- to be -1 if could not find a compas case at all.
# - Ordinary traffic offenses removed
# - Filtered the underlying data from Broward county to include:
# + people who had either recidivated in two years
# + had at least two years outside of a correctional facility.
#Unlike ProPublica, we won't be dropping any attributes because that may bias our model.
### All recidivism ###
#Dropping bad data
recid.all <-compas.scores.two.years %>%
filter(days_b_screening_arrest <= 30) %>%
filter(days_b_screening_arrest >= -30) %>%
filter(is_recid != -1) %>%
filter(c_charge_degree != "O") %>%
filter(score_text != 'N/A')
#Recoding variables
recid.all$length_of_stay <- as.numeric(as.Date(recid.all$c_jail_out) - as.Date(recid.all$c_jail_in))
recid.all <- mutate(recid.all, crime_factor = factor(c_charge_degree)) %>%
mutate(age_factor = as.factor(age_cat)) %>%
within(age_factor <- relevel(age_factor, ref = 1)) %>%
mutate(race_factor = factor(race)) %>%
within(race_factor <- relevel(race_factor, ref = 3)) %>%
mutate(gender_factor = factor(sex, labels= c("Female","Male"))) %>%
within(gender_factor <- relevel(gender_factor, ref = 2)) %>%
mutate(score_factor = factor(score_text != "Low", labels = c("LowScore","HighScore")))
### Violent recidivism ###
#Dropping bad data
recid.vio <- compas.scores.two.years.violent %>%
filter(days_b_screening_arrest <= 30) %>%
filter(days_b_screening_arrest >= -30) %>%
filter(is_recid != -1) %>%
filter(c_charge_degree != "O") %>%
filter(v_score_text != 'N/A')
#Recoding variables
#KR note: Why is race coded differently? Is there a mistake?
recid.vio <- mutate(recid.vio, crime_factor = factor(c_charge_degree)) %>%
mutate(age_factor = as.factor(age_cat)) %>%
within(age_factor <- relevel(age_factor, ref = 1)) %>%
mutate(race_factor = factor(race,
labels = c("African-American",
"Asian",
"Caucasian",
"Hispanic",
"Native American",
"Other"))) %>%
within(race_factor <- relevel(race_factor, ref = 3)) %>%
mutate(gender_factor = factor(sex, labels= c("Female","Male"))) %>%
within(gender_factor <- relevel(gender_factor, ref = 2)) %>%
mutate(score_factor = factor(v_score_text != "Low", labels = c("LowScore","HighScore")))
####################################
# Split the data?????????????????? #
####################################
#Per Saturday's call, maybe this is where we should randomly split the data. If so, we should use a seed so it's the same every time.
[KR note: must check whether 1 person per row or different row driver]
This project uses 2 datasets from ProPublica’s github repository, which are described in the table below:
| Dataset name | original obs | obs after cleaning | # varibles |
|---|---|---|---|
| recid.all | 7214 | 6172 | 59 |
| recid.vio | 4743 | 4020 | 59 |
Out of the 59 variables in each dataset, we first looked at the variables in the logistic that ProPublica ran to see how well recidivism could predict whether somebody was flagged as low-risk. Their code is pasted below:
glm(formula = score_factor ~ gender_factor + age_factor + race_factor + priors_count + crime_factor + two_year_recid, family = "binomial", data = df)
An overview of variables that ProPublica used is below: [KR note: this dictionary could be better.] - two_year_recid: this is the variable that ProPublica used to indicate recidivism in each dataset.
- score_factor: the binary variable indicating low risk vs. medium/high risk. - Other covariates: -gender_factor: male vs. female -age_factor: a 3-level categorical variable for age -race_factor: from the original race variable. As will be shown below, most categories have very few observations.
-crime_factor: felonies vs. misdimeanors -priors_count: prior infractions?
A list of all of the variables in the dataset is shown below:
#It'll be ideal if we can delete this section eventually (or make it prettier). However, I'm leaving this here so you can get to know the data better without running everything from scratch.
colnames.all <- colnames(recid.all)
colnames.vio <- colnames(recid.vio)
#Print full list of columns
colnames.all
## [1] "id" "name"
## [3] "first" "last"
## [5] "compas_screening_date" "sex"
## [7] "dob" "age"
## [9] "age_cat" "race"
## [11] "juv_fel_count" "decile_score"
## [13] "juv_misd_count" "juv_other_count"
## [15] "priors_count" "days_b_screening_arrest"
## [17] "c_jail_in" "c_jail_out"
## [19] "c_case_number" "c_offense_date"
## [21] "c_arrest_date" "c_days_from_compas"
## [23] "c_charge_degree" "c_charge_desc"
## [25] "is_recid" "r_case_number"
## [27] "r_charge_degree" "r_days_from_arrest"
## [29] "r_offense_date" "r_charge_desc"
## [31] "r_jail_in" "r_jail_out"
## [33] "violent_recid" "is_violent_recid"
## [35] "vr_case_number" "vr_charge_degree"
## [37] "vr_offense_date" "vr_charge_desc"
## [39] "type_of_assessment" "decile_score_1"
## [41] "score_text" "screening_date"
## [43] "v_type_of_assessment" "v_decile_score"
## [45] "v_score_text" "v_screening_date"
## [47] "in_custody" "out_custody"
## [49] "priors_count_1" "start"
## [51] "end" "event"
## [53] "two_year_recid" "length_of_stay"
## [55] "crime_factor" "age_factor"
## [57] "race_factor" "gender_factor"
## [59] "score_factor"
#Print columns that differ
colnames.vio[(colnames.vio != colnames.all)]
## [1] "two_year_recid_1"
colnames.all[(colnames.vio != colnames.all)]
## [1] "length_of_stay"
Given the multiple posibilities for recidivism outcome variables, the tables below show the checks that were run to determine that two_year_recid is the correct outcome variable to use in both datasets. [KR note: mabe delete this later. But for now, it would be great if somebody can confirm that you agree with this conclusion or decide whether we should create a new variable for violent recidivism or something].
#Same comment as above: It'll be ideal if we can delete this section eventually (or make it prettier). However, I'm leaving this here so you can get to know the data better without running everything from scratch.
#Check how the recidivism variables relate
kable(recid.all %>%
group_by (two_year_recid, is_recid, is_violent_recid, violent_recid) %>%
summarize(count = n()), caption = "How variables relate in recid.all dataset")
| two_year_recid | is_recid | is_violent_recid | violent_recid | count |
|---|---|---|---|---|
| 0 | 0 | 0 | NA | 3182 |
| 0 | 1 | 0 | NA | 141 |
| 0 | 1 | 1 | NA | 40 |
| 1 | 1 | 0 | NA | 2157 |
| 1 | 1 | 1 | NA | 652 |
kable(recid.vio %>%
group_by (two_year_recid, two_year_recid_1, is_recid, is_violent_recid, violent_recid) %>%
summarize(count = n()), caption = "How variables relate in recid.vio dataset")
| two_year_recid | two_year_recid_1 | is_recid | is_violent_recid | violent_recid | count |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | NA | 3187 |
| 0 | 0 | 1 | 0 | NA | 141 |
| 0 | 0 | 1 | 1 | NA | 40 |
| 1 | 1 | 1 | 1 | NA | 652 |
Note: These graphs use the full dataset (not limited to violent crime) The univariate graphs below show that the most categories in the dataset were men, black people, and those between the age of of 20 and 45.
#These three graphs are identical, except with different input variables.
gender_factor.uni.all <- ggplot(recid.all, aes(x=gender_factor))
gender_factor.uni.all +
geom_bar(stat = "count", na.rm = F) +
ggtitle("Univariate Gender Breakdown") +
guides(fill = FALSE) +
xlab("Gender") +
ylab("Number of People")
race_factor.uni.all <- ggplot(recid.all, aes(x=race_factor))
race_factor.uni.all +
geom_bar(stat = "count", na.rm = F) +
ggtitle("Univariate Race Breakdown") +
guides(fill = FALSE) +
xlab("Race") +
ylab("Number of People")
age_factor.uni.all <- ggplot(recid.all, aes(x=age_factor))
age_factor.uni.all +
geom_bar(stat = "count", na.rm = F) +
ggtitle("Univariate Age Breakdown") +
guides(fill = FALSE) +
xlab("Age") +
ylab("Number of People")
The table below shows how common two-year recidivism of both types was in our dataset.
rate.all <- recid.all %>%
summarize(100*round(mean(two_year_recid), 2))
rate.vio <- recid.vio %>%
summarize(100*round(mean(two_year_recid), 2))
| Dataset | Two-year Rate |
|---|---|
| All recidivism (recid.all) | 46% |
| Violent recidivism (recid.vio) | 16% |
The graphs below show participants’ 10-point scores, where a score of 10 indicates the highest recidivism potential. Judges were often shown categories of low, medium, and high risk. The ProPublica analysis combined the categories of medium and high in order to create a binary variable called score_factor with two categories.
#Create histograms of decile scores for all crime, color coded by the category.
ranking.all <- ggplot(recid.all, aes(x=decile_score, fill = score_text))+
geom_bar(stat = "count", bin = 1) +
ggtitle("Prediction Metrics\nAll data") +
xlab("Prediction") +
guides(fill = FALSE) +
ylab("Number of people")
#Note that v_score_text and v_decile_score are different from decile_score and score_text
ranking.vio <- ggplot(recid.vio, aes(x=v_decile_score, fill = v_score_text))+
geom_bar(stat = "count", bin = 1) +
ggtitle("Prediction Metrics\nViolence data") +
xlab("Prediction")+
ylab(NULL)
grid.arrange(ranking.all, ranking.vio, ncol = 2)
[KR note to team: I actually expected these wouldn’t look alike at all. But by the time I finished this, I’d spent so much time adapting code from our good ol’ NLSY project that now I want to keep them.] Initial data exploration showed that the trends in recidivism were similar to predictions. However, a few limitations should be noted:
- The higher rate of recidivism among African-American respondents cannot be separated from bias in policing [cite Michelle Alexander].
- This includes no interactions between terms demographic characteristics (ex. race and gender)
- As ProPublica pointed out, these charts do not differentiate between false positives and false negatives.
Additionally, we already know from the univariate race graphs that the recidivism rates for groups other than White, Hispanic, and African-American people will be distorted by a small sample size. [KR note: perhaps we should drop these obs or create a new race var?]
#Note -- just copying and pasting same charts for different characteristics. Could create a function instead. Also, note that I only ran this for recid.all because running it for recid.vio seemed like too many charts.
### gender_factor ###
# Bivariate w/ decile score
gender_factor.bi.conf.all <- recid.all %>%
group_by(gender_factor) %>%
summarize(Avgdecile_score = mean(decile_score),
upper = t.test(decile_score)$conf.int[1],
lower = t.test(decile_score)$conf.int[2])
gender_factor.bi.all.dec <- ggplot(data = gender_factor.bi.conf.all, aes(x = gender_factor, y = Avgdecile_score)) + geom_bar(stat = "identity") +
xlab("gender_factor") +
ylab("Average prediction") +
ggtitle("Prediction from COMPAS") +
guides(fill = FALSE) +
geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
theme(text = element_text(size=12))
# Bivariate w/ recid
gender_factor.bi.conf.all <- recid.all %>%
group_by(gender_factor) %>%
summarize(Avgtwo_year_recid = mean(two_year_recid),
upper = t.test(two_year_recid)$conf.int[1],
lower = t.test(two_year_recid)$conf.int[2])
gender_factor.bi.all.recid <- ggplot(data = gender_factor.bi.conf.all, aes(x = gender_factor, y = Avgtwo_year_recid)) +
geom_bar(stat = "identity") +
xlab("gender_factor") +
ylab("Average Rate") +
ggtitle("Actual 2-year Recidivism") +
guides(fill = FALSE) +
geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
theme(text = element_text(size=12))
### RACE ###
# Bivariate w/ decile score
race_factor.bi.conf.all <- recid.all %>%
group_by(race_factor) %>%
summarize(Avgdecile_score = mean(decile_score),
upper = t.test(decile_score)$conf.int[1],
lower = t.test(decile_score)$conf.int[2])
race_factor.bi.all.dec <- ggplot(data = race_factor.bi.conf.all, aes(x = race_factor, y = Avgdecile_score)) + geom_bar(stat = "identity") +
xlab("Race") +
ylab("Average prediction") +
ggtitle("Prediction from COMPAS") +
guides(fill = FALSE) +
geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
theme(text = element_text(size=12), axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
# Bivariate w/ recid
race_factor.bi.conf.all <- recid.all %>%
group_by(race_factor) %>%
summarize(Avgtwo_year_recid = mean(two_year_recid),
upper = t.test(two_year_recid)$conf.int[1],
lower = t.test(two_year_recid)$conf.int[2])
race_factor.bi.all.recid <- ggplot(data = race_factor.bi.conf.all, aes(x = race_factor, y = Avgtwo_year_recid)) +
geom_bar(stat = "identity") +
xlab("Race") +
ylab("Average Rate") +
ggtitle("Actual 2-year Recidivism") +
guides(fill = FALSE) +
geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
theme(text = element_text(size=12), axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
### AGE ###
# Bivariate w/ decile score
age_factor.bi.conf.all <- recid.all %>%
group_by(age_factor) %>%
summarize(Avgdecile_score = mean(decile_score),
upper = t.test(decile_score)$conf.int[1],
lower = t.test(decile_score)$conf.int[2])
age_factor.bi.all.dec <- ggplot(data = age_factor.bi.conf.all, aes(x = age_factor, y = Avgdecile_score)) + geom_bar(stat = "identity") +
xlab("Age") +
ylab("Average prediction") +
ggtitle("Prediction from COMPAS") +
guides(fill = FALSE) +
geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
theme(text = element_text(size=12))
# Bivariate w/ recid
age_factor.bi.conf.all <- recid.all %>%
group_by(age_factor) %>%
summarize(Avgtwo_year_recid = mean(two_year_recid),
upper = t.test(two_year_recid)$conf.int[1],
lower = t.test(two_year_recid)$conf.int[2])
age_factor.bi.all.recid <- ggplot(data = age_factor.bi.conf.all, aes(x = age_factor, y = Avgtwo_year_recid)) +
geom_bar(stat = "identity") +
xlab("age") +
ylab("Average Rate") +
ggtitle("Actual 2-year Recidivism") +
guides(fill = FALSE) +
geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
theme(text = element_text(size=12))
#Display the predictions versus actual recidivism rate averages
grid.arrange(gender_factor.bi.all.dec, gender_factor.bi.all.recid, ncol = 2)
grid.arrange(race_factor.bi.all.dec, race_factor.bi.all.recid, ncol = 2)
grid.arrange(age_factor.bi.all.dec, age_factor.bi.all.recid, ncol = 2)
Prior to starting the data mining process, we randomly split the data into training and validation sets.
#Somebody figure out how we're supposed to do this. I thought I understood it until we learned about cross-validation, which is better than regular validation. Also, should we do it here or before EDA?
2a.1: Model Selection and validation
Possible models
[KR note to team: we may cut some of these if we a) don’t have time or b) think they’re the wrong models]
Given our task of predicting binary outcomes, we considered the following classification methods:
- Logistic regression:
- Naive Bayes (NB): This works when there is no interaction between predictors.
- Linear Discriminant Analysis (LDA): This works when the interaction between predictors is the same across classes.
- Quadratic Discriminant Analysis (QDA): This works when there are class-based interactions among the predictors
- Random Forrest (RF): - Classification Tree: Although not as trustworthy as random forrest, the potential for public transparency makes this worth trying.
- K-Nearest Neighbors (KNN):
Sormeh Attempts Logistic Regression
## Going to try with all of the factors for now, and then I will be more selective
## working with dataset = recid.all
#glm.fits = glm(two_year_recid ~age_factor+race_factor+gender_factor+crime_factor+age, data=recid.all, family=binomial)
glm.fits = glm(two_year_recid ~age_factor+race_factor+gender_factor+crime_factor+priors_count, data=recid.all, family=binomial)
#glm.fits = glm(two_year_recid ~ race_factor+gender_factor+crime_factor+age, data=recid.all, family=binomial)
summary(glm.fits)
##
## Call:
## glm(formula = two_year_recid ~ age_factor + race_factor + gender_factor +
## crime_factor + priors_count, family = binomial, data = recid.all)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7430 -0.9690 -0.6501 1.0858 2.0686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.608309 0.064512 -9.429 < 2e-16 ***
## age_factorGreater than 45 -0.668217 0.076095 -8.781 < 2e-16 ***
## age_factorLess than 25 0.733479 0.068937 10.640 < 2e-16 ***
## race_factorAfrican-American 0.095986 0.062700 1.531 0.125800
## race_factorAsian -0.550754 0.427799 -1.287 0.197950
## race_factorHispanic -0.170372 0.108815 -1.566 0.117419
## race_factorNative American -0.270741 0.650668 -0.416 0.677339
## race_factorOther -0.154898 0.127679 -1.213 0.225061
## gender_factorFemale -0.348609 0.071851 -4.852 0.00000122 ***
## crime_factorM -0.218718 0.058759 -3.722 0.000197 ***
## priors_count 0.165543 0.008068 20.518 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8506.4 on 6171 degrees of freedom
## Residual deviance: 7576.2 on 6161 degrees of freedom
## AIC: 7598.2
##
## Number of Fisher Scoring iterations: 4
## To just see the coefficients
summary(glm.fits)$coef[,4]
## (Intercept) age_factorGreater than 45
## 4.122705e-21 1.614884e-18
## age_factorLess than 25 race_factorAfrican-American
## 1.943911e-26 1.258000e-01
## race_factorAsian race_factorHispanic
## 1.979497e-01 1.174193e-01
## race_factorNative American race_factorOther
## 6.773393e-01 2.250612e-01
## gender_factorFemale crime_factorM
## 1.223076e-06 1.974087e-04
## priors_count
## 1.477025e-93
## Create random training set and testing set
## adding a column of random 0 and 1s
recid.all$is.test <- sample(c(0,1), size = nrow(recid.all), replace=TRUE)
## Partition recid.all data
recid.all.train <- subset(recid.all, subset = is.test == 0)
recid.all.test <- subset(recid.all, subset = is.test == 1)
## Predict function to predict the probability that someone will have recidivated in 2 years
## type="response" tells R to output probabilities of the form P(Y=1|X)
print("Predicing two_year_recid:")
## [1] "Predicing two_year_recid:"
glm.probs = predict(glm.fits, type="response")
glm.probs[1:10]
## 1 2 3 4 5 6 7 8
## 0.1928770 0.3746489 0.7075113 0.2725100 0.8467446 0.4337475 0.2358377 0.3524450
## 9 10
## 0.6222399 0.2358377
#contrasts(two_year_recid) ## this does not work for some reason...
p <- ggplot(data = recid.all, aes(two_year_recid, fill = as.factor(glm.fits$y))) +
geom_histogram(binwidth = 1)
ggplotly(p)
## Trying again but just with training data
## Predict function to predict the probability that someone will have recidivated in 2 years
## type="response" tells R to output probabilities of the form P(Y=1|X)
print("Predicing two_year_recid **Training Data**:")
## [1] "Predicing two_year_recid **Training Data**:"
glm.logit <- glm(two_year_recid ~ age_factor+race_factor+gender_factor+crime_factor+priors_count, data=recid.all.train, family = binomial)
summary(glm.logit)
##
## Call:
## glm(formula = two_year_recid ~ age_factor + race_factor + gender_factor +
## crime_factor + priors_count, family = binomial, data = recid.all.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5637 -0.9931 -0.6260 1.0495 2.1180
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.45019 0.09047 -4.976 6.49e-07 ***
## age_factorGreater than 45 -0.71728 0.10876 -6.595 4.25e-11 ***
## age_factorLess than 25 0.78313 0.09819 7.976 1.52e-15 ***
## race_factorAfrican-American 0.03289 0.08905 0.369 0.712
## race_factorAsian -0.67250 0.59312 -1.134 0.257
## race_factorHispanic -0.07728 0.15398 -0.502 0.616
## race_factorNative American -0.26959 0.79542 -0.339 0.735
## race_factorOther -0.39688 0.18175 -2.184 0.029 *
## gender_factorFemale -0.52317 0.10231 -5.113 3.17e-07 ***
## crime_factorM -0.36282 0.08333 -4.354 1.34e-05 ***
## priors_count 0.15273 0.01096 13.938 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4248.7 on 3080 degrees of freedom
## Residual deviance: 3749.5 on 3070 degrees of freedom
## AIC: 3771.5
##
## Number of Fisher Scoring iterations: 4
kable(coef(summary(glm.logit)), digits= 3)
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.450 | 0.090 | -4.976 | 0.000 |
| age_factorGreater than 45 | -0.717 | 0.109 | -6.595 | 0.000 |
| age_factorLess than 25 | 0.783 | 0.098 | 7.976 | 0.000 |
| race_factorAfrican-American | 0.033 | 0.089 | 0.369 | 0.712 |
| race_factorAsian | -0.672 | 0.593 | -1.134 | 0.257 |
| race_factorHispanic | -0.077 | 0.154 | -0.502 | 0.616 |
| race_factorNative American | -0.270 | 0.795 | -0.339 | 0.735 |
| race_factorOther | -0.397 | 0.182 | -2.184 | 0.029 |
| gender_factorFemale | -0.523 | 0.102 | -5.113 | 0.000 |
| crime_factorM | -0.363 | 0.083 | -4.354 | 0.000 |
| priors_count | 0.153 | 0.011 | 13.938 | 0.000 |
glm.probs = predict(glm.logit, type="response")
glm.probs[1:10]
## 1 2 3 4 5 6 7 8
## 0.3971614 0.2297209 0.8439626 0.4039866 0.2081385 0.6132963 0.2081385 0.1768393
## 9 10
## 0.6574151 0.2910505
#contrasts(two_year_recid) ## this does not work for some reason...
p <- ggplot(data = recid.all.train, aes(two_year_recid, fill = as.factor(glm.logit$y))) +
geom_histogram(binwidth = 1)
ggplotly(p)
#spam.logit
glm.probs = predict.glm(glm.logit, type="response")
confusion.glm = ifelse(glm.probs > 0.5, "1", "0")
conf.train.pred = table(confusion.glm, recid.all.train$two_year_recid)
conf.train.pred
##
## confusion.glm 0 1
## 0 1286 594
## 1 386 815
#recid.test.logit <- glm(two_year_recid ~ age_factor+race_factor+gender_factor+crime_factor+age, data=recid.all.test, family=binomial)
recid.test.logit <- glm(two_year_recid ~ age_factor+race_factor+gender_factor+crime_factor+priors_count, data=recid.all.test, family=binomial)
glm.probs = predict.glm(recid.test.logit, type="response")
conf.test.glm = ifelse(glm.probs > 0.5, "1", "0")
conf.test.pred = table(conf.test.glm, recid.all.test$two_year_recid)
conf.test.pred
##
## conf.test.glm 0 1
## 0 1328 652
## 1 363 748
# Misclassification
print("Train Misclassification:")
## [1] "Train Misclassification:"
1- sum(diag(conf.train.pred)) / sum(conf.train.pred)
## [1] 0.3180785
print("Test Misclassification:")
## [1] "Test Misclassification:"
1-sum(diag(conf.test.pred)) / sum(conf.test.pred)
## [1] 0.3283727
### COME BACK TO ME!!!
print("Come back here for editing...trying to plot the additive logistic models from homework 4 problem 3")
## [1] "Come back here for editing...trying to plot the additive logistic models from homework 4 problem 3"
#recid.formula <- formula(paste("two_year_recid ~ ", paste("s(", varinfo[,1], ", 4)", sep = "", collapse= " + ")))
# Edit me
#recid.gam <- gam(recid.formula, data=recid.all.train, family = binomial)
#summary(recid.gam)
#par(mfrow=c(15,4))
#plot(recid.gam, se=T, col="ForestGreen")
Sormeh Attempts Naive Bayes Could not find how we did this in lab, so following tutorial at uc-r.github.io/naive_bayes
## Create training (70%) and test (30%) sets for the attrition data
## Use set.seed for reproducibility
set.seed(123)
split <- initial_split(recid.all, prop = .7, strata = "two_year_recid")
train <- training(split)
test <- testing(split)
## Distribution of two_year_recid rates across train and test set
print("Training set distribution:")
## [1] "Training set distribution:"
table(train$two_year_recid) %>% prop.table
##
## 0 1
## 0.5448866 0.4551134
print("Testing set distribution:")
## [1] "Testing set distribution:"
table(test$two_year_recid) %>% prop.table
##
## 0 1
## 0.5448649 0.4551351
Bayesian probability incorporates the concept of conditional probability, the probability of event A given that event B has occurred [denoted as P(A|B)]. In the context of our data, we are seeking the probability of an individual belonging to two_year_recid = 1, given that its predictor values are [….. something] Here’s a plot that attempts to show correlation among variables. It is different from the way we’ve typically done it in class
## I guess this is a different way of seeing if variables are correlated with one another??
train %>%
filter(two_year_recid == 1) %>%
select_if(is.numeric) %>%
cor() %>%
corrplot::corrplot()
## Warning in stats::cor(x, ...): the standard deviation is zero
Using caret package for Naive Bayes Running into trouble here…
##Ok, I'm not sure why it asks to make features data, but it says to create response and feature data
features <- setdiff(names(train), "two_year_recid")
x <- train[, features]
y <- train$two_year_recid
#x <- as.factor(x)
#y <- as.factor(y)
## set up 10-fold cross validation procedure
train_control <- trainControl(method = "cv", number = 10)
## train model
#nb.m1 <- train(x = x, y = y, method = "nb", trControl = train_control)
## results
#confusionMatrix(nb.m1)
Moses work:
#Choosing variables
recid.subsets <- regsubsets(as.factor(two_year_recid) ~
crime_factor +
age +
age_factor +
race_factor +
gender_factor +
log2(priors_count + 1) +
priors_count +
juv_fel_count +
(juv_fel_count > 0) +
juv_misd_count +
(juv_misd_count > 0) +
juv_other_count,
data = recid.all,
nbest = 1,
nvmax = 15,
method = "exhaustive")
plot(x = summary(recid.subsets)$adjr2, xlab = "Number of Variables", ylab = "Adj. R-Squared")
maxr2 <- which.max(summary(recid.subsets)$adjr2)
points(maxr2, summary(recid.subsets)$adjr2[maxr2], col = "red", cex = 1, pch = 20)
LDA work
#LDA
#crime_factor, race_factor, and juv_other_count slightly improve the model & increase specificity (at the expense of sensitivity) - should we include them??
#
#without: acc = 67.8, sens = 60.6, spec = 73.9 DOMINATED BY EXCL. RACE, EXCL. CRIME
#all three: acc = 68.1, sens = 60.4, spec = 74.5 Probably overfit
#exclude race_factor: acc = 68.1, sens = 60.7, spec = 74.3 Highest sensitivity, accuracy
#exclude juv_other_count: acc = 67.9, sens = 60.0, spec = 74.5 DOMINATED BY INCL. JUV, ALL THREE
#exclude crime_factor: acc = 68.0, sens = 60.6, spec = 74.2 DOMINATED BY EXCL. RACE
#include race_factor: acc = 67.9, sens = 60.0, spec = 74.4 DOMINATED BY EXCL. JUV, ALL THREE
#include juv_other_count: acc = 68.0, sens = 60.1, spec = 74.6 Strong, high-specificity
#include crime_factor: acc = 67.9, sens = 59.8, spec = 74.7 Highest specificity
recid.lda <- lda(two_year_recid ~
#crime_factor +
age +
(age_factor == "Less than 25") +
#race_factor +
gender_factor +
#juv_other_count +
log2(priors_count + 1),
data = recid.all)
#subset = train.all)
#How do I get it to predict the test set?
recid.lda.pred <- predict(recid.lda,
type = "response")
lda.tab = table(predicted = recid.lda.pred$class,
actual = recid.all$two_year_recid)
lda.tab
## actual
## predicted 0 1
## 0 2484 1106
## 1 879 1703
lda.acc <- (lda.tab[1,1] + lda.tab[2,2]) / sum(lda.tab)
lda.sens <- lda.tab[2,2] / sum(lda.tab[,2])
lda.spec <- lda.tab[1,1] / sum(lda.tab[,1])
tibble(Accuracy = scales::percent(lda.acc, accuracy = 0.1),
Sensitivity = scales::percent(lda.sens, accuracy = 0.1),
Specificity = scales::percent(lda.spec, accuracy = 0.1))
## # A tibble: 1 x 3
## Accuracy Sensitivity Specificity
## <chr> <chr> <chr>
## 1 67.8% 60.6% 73.9%
recid.qda <- qda(two_year_recid ~
#crime_factor +
age +
(age_factor == "Less than 25") +
gender_factor +
log2(priors_count + 1),
data = recid.all)
recid.qda.pred <- predict(recid.qda,
type = "response")
qda.tab = table(predicted = recid.qda.pred$class,
actual = recid.all$two_year_recid)
qda.tab
## actual
## predicted 0 1
## 0 2329 982
## 1 1034 1827
qda.acc <- (qda.tab[1,1] + qda.tab[2,2]) / sum(qda.tab)
qda.sens <- qda.tab[2,2] / sum(qda.tab[,2])
qda.spec <- qda.tab[1,1] / sum(qda.tab[,1])
tibble(Accuracy = scales::percent(qda.acc, accuracy = 0.1),
Sensitivity = scales::percent(qda.sens, accuracy = 0.1),
Specificity = scales::percent(qda.spec, accuracy = 0.1))
## # A tibble: 1 x 3
## Accuracy Sensitivity Specificity
## <chr> <chr> <chr>
## 1 67.3% 65.0% 69.3%
We then decided to pursue [which models should we use?]
#Run these models on training data
#Run cross-validation on these models
#Select the best version of each model based on a low CV
#Update after meeting: could we actually decide based on high specificity/sensitivity, or is that not legit?
#Output line graphs showing how training and CV error levels work (I think this is a crucial part of the prompt)
#Update after meeting: perhaps first run w/o including race, but then in the discussion section, re-run the final models including race and see if they become less racist?
To decide between models, we looked at three measures of error: -Accuracy: This is the percent of predictions that were correct. It is a limited measure of the tool’s usefulness because it counts false positives and false negatives in the same way, when their human impact is much different.
-Specifity: This shows the true negative rate. A higher number shows that we are not mistakenly classifying people as high-risk. general, we believe this is the most important measure because it keeps innocent people from receiving harsh punishments.
-Sensitivity: This shows the true positive rate. A higher number shows that we successfully predicted people would commit crimes. For violent crime, a high sensitivity rate is important.
Statistics for the best version of each data mining method
[fill this in once we’ve run the models – this is just a template]
| Model for any recidivism | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| Logistic regression | |||
| Linear Discriminant Analysis (LDA) | |||
| Quadratic Discriminant Analysis (QDA) | |||
| Naive Bayes (NB) | |||
| Random Forrest (RF) |
| Model for violent recidivism | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| Logistic regression | |||
| Linear Discriminant Analysis (LDA) | |||
| Quadratic Discriminant Analysis (QDA) | |||
| Naive Bayes (NB) | |||
| Random Forrest (RF) |
[Let’s add some graphs here that show tradoffs between different measures that we care about (like sensitivity vs. specificity. Caulkins would be proud)]
2c: Demographic analysis of the best version of each model:
After choosing the best version of each model, we assessed the potential for prejudice. Our goal was to replicate Pro-Publica’s chart for each of these metrics included in the introduction. [If possible, maybe create some charts comparing models on accuracy vs. prejudice tradeoffs]
2c1: Race
2c2: Gender
2c3: Age
Our final model is…. This has an accuracy rate of [], specifity of []. We chose it because…
#Add pretty figures about our final model(s)
3a.1: Does our RAIs perform better or worse than COMPAS?
3a.2: Do our RAIs produce similar classifications to COMPAS?
3a.3: Are there systematic differences between our classifications and those of COMPAS?
Perhaps cite some articles here, like the two that ProPublica cited, or something more recent Columbia article on Risk as proxy for race (2010) Article on Risk, Race, and Recidivism